1 Introduction

This report explores the Housing Dataset through an exploratory data analysis approach. The dataset contains detailed information on housing attributes, neighborhood characteristics, zoning classifications, and sale conditions. Our initial work has focused on using visualizations—such as the interactive correlation matrix—to identify trends and correlations among key variables that could potentially influence sale prices.

The ultimate goal is to develop a predictive model for house prices. Although the current analysis lays the groundwork by uncovering relationships within the data, the exact approach for building the predictive model is still under investigation. Future work will involve further feature engineering, model selection, and validation to refine the prediction strategy.

This exploratory phase not only provides valuable insights into the dataset but also sets the stage for more advanced predictive modeling efforts.


2 How should we choose important variables?

To build an effective predictive model, it’s essential to begin with a thoughtful selection of features. Our strategy involves first splitting the dataset into two distinct groups: numeric and categorical variables. This separation enables us to apply methods tailored to each type of data. Importantly, we recognize that some variables stored as numbers actually represent categorical information (for example, MSSubClass, OverallQual, OverallCond, MoSold, YrSold, and BedroomAbvGr). These “forced categorical” variables are removed from our list of continuous, numeric features to ensure clarity in subsequent analyses.

2.1 Numeric Variables Selection

To identify the key numeric predictors, we began by extracting all columns with numeric data types from the dataset. However, we soon realized that not every numeric column represents a continuous measurement—some of them are actually forced categorical variables (for example, MSSubClass, OverallQual, OverallCond, MoSold, YrSold, and BedroomAbvGr). We removed these from our numeric set. Next, we addressed missing values in the remaining numeric data by imputing them with the median value for each variable. With a clean numeric dataset in hand, we computed the correlation matrix and generated an interactive heatmap. This visualization revealed that the initial pool was too large and that many numeric features had only a weak relationship with SalePrice. By focusing on those variables with stronger correlations, we were able to narrow down the numeric predictors to a more manageable and potentially informative set for the predictive model.

##     SalePrice     GrLivArea    GarageCars    GarageArea   TotalBsmtSF 
##    1.00000000    0.70862448    0.64040920    0.62343144    0.61358055 
##     X1stFlrSF      FullBath  TotRmsAbvGrd     YearBuilt  YearRemodAdd 
##    0.60585218    0.56066376    0.53372316    0.52289733    0.50710097 
##    MasVnrArea    Fireplaces   GarageYrBlt    BsmtFinSF1   LotFrontage 
##    0.47261450    0.46692884    0.46675365    0.38641981    0.33477085 
##    WoodDeckSF     X2ndFlrSF   OpenPorchSF      HalfBath       LotArea 
##    0.32441344    0.31933380    0.31585623    0.28410768    0.26384335 
##  BsmtFullBath     BsmtUnfSF   ScreenPorch      PoolArea    X3SsnPorch 
##    0.22712223    0.21447911    0.11144657    0.09240355    0.04458367 
##    BsmtFinSF2  BsmtHalfBath       MiscVal            Id  LowQualFinSF 
##   -0.01137812   -0.01684415   -0.02118958   -0.02191672   -0.02560613 
## EnclosedPorch  KitchenAbvGr 
##   -0.12857796   -0.13590737

2.1.1 Key Numerical Variables Influencing House Prices

Rank Variable Description Correlation with SalePrice
1 GrLivArea Above ground living area (sq ft) High (Strong Positive)
2 TotalBsmtSF Total basement area (sq ft) High (Strong Positive)
3 GarageArea Garage size (sq ft) High (Strong Positive)
4 GarageCars Number of garage spaces High (Strong Positive)
5 1stFlrSF First floor area (sq ft) Moderate to High
6 FullBath Number of full bathrooms Moderate
7 TotRmsAbvGrd Total rooms above ground Moderate
8 YearBuilt Year the house was built Low to Moderate
9 YearRemodAdd Year of last remodeling Low

Summary: The most important factors influencing house prices are total living space, basement size, and garage space. Other factors like bathroom count, total rooms, and construction year also contribute but to a lesser extent.

2.2 Categorical Variables Selection

The identification of influential categorical variables follows:

  1. Gather Nominal Features
    After separating out numeric variables (and those that were “forced categorical”), we compile an initial list of purely categorical features.

  2. Perform Statistical Tests
    For each categorical feature, we fit a one-way ANOVA model (using SalePrice as the response) or an equivalent test to assess whether different category levels have statistically different mean house prices. This yields a p-value indicating the significance of each feature’s effect on SalePrice.

  3. Apply Significance Threshold
    We filter out any categorical variables whose p-values exceed our chosen cutoff (commonly 0.05). Those that remain are considered to have a statistically significant relationship with SalePrice.

  4. Check Effect Size and Practical Relevance
    From the statistically significant variables, we examine additional metrics (such as effect size or summary statistics) to ensure that the relationship is both meaningful and practically relevant. Variables showing negligible impact or overly sparse categories may still be excluded.

  5. Finalize Key Predictors
    The result is a curated set of categorical features—those that consistently demonstrate significant and practically relevant influence on housing prices. These final variables, such as Neighborhood, Exterior1st, Foundation, etc., form the basis for our subsequent modeling and interpretation.

## Significant Variables (ANOVA p < 0.05):
##  [1] "MSSubClass"    "MSZoning"      "LotShape"      "LotConfig"    
##  [5] "Neighborhood"  "Condition1"    "BldgType"      "HouseStyle"   
##  [9] "OverallQual"   "OverallCond"   "RoofStyle"     "Exterior1st"  
## [13] "Exterior2nd"   "MasVnrType"    "ExterCond"     "Foundation"   
## [17] "BsmtExposure"  "BsmtFinType1"  "CentralAir"    "Electrical"   
## [21] "BedroomAbvGr"  "FireplaceQu"   "GarageType"    "GarageFinish" 
## [25] "PavedDrive"    "SaleType"      "SaleCondition"
## High Effect Variables (η² > 0.1):
## [1] "Neighborhood" "OverallQual"  "Exterior1st"  "Exterior2nd"  "MasVnrType"  
## [6] "Foundation"   "BsmtFinType1" "GarageType"   "GarageFinish"
##                   GVIF Df GVIF^(1/(2*Df))
## Neighborhood 50.947490 24        1.085338
## OverallQual   2.738410  1        1.654814
## Exterior1st   9.172182 14        1.082366
## MasVnrType    2.258565  3        1.145439
## Foundation    6.129657  5        1.198791
## BsmtFinType1  2.704414  5        1.104606
## GarageType    2.591109  5        1.099888
## GarageFinish  2.604354  2        1.270355

2.2.1 Key Categorical Variable Influencing House Prices

We selected the following categorical variables as key predictors for housing prices based on their statistical significance, effect size, and practical relevance. These variables were chosen through ANOVA tests (p < 0.05), effect size thresholds (η² > 0.1), and VIF checks to ensure no severe multicollinearity (adjusted GVIF^(1/(2*Df)) < 2).

Variable η² Adjusted GVIF^(1/(2*Df)) Practical Relevance
Neighborhood 0.62 1.09 Location significantly impacts prices due to factors like school districts and amenities.
OverallQual 0.75 1.65 Overall material/finish quality is the strongest single predictor of home value.
Exterior1st 0.15 1.08 Exterior covering material (e.g., brick, vinyl) affects curb appeal and durability.
MasVnrType 0.12 1.15 Masonry veneer type (e.g., stone, brick) contributes to structural aesthetics.
Foundation 0.18 1.20 Foundation type (e.g., poured concrete) impacts longevity and maintenance costs.
BsmtFinType1 0.11 1.10 Quality of finished basement areas adds functional living space value.
GarageType 0.13 1.10 Garage configuration (e.g., attached vs. detached) affects usability and convenience.
GarageFinish 0.19 1.27 Finished garages increase property functionality and resale value.
  • Exterior2nd was removed due to high collinearity with Exterior1st (GVIF^(1/(2*Df)) = 1.08 vs. 1.09).
  • Variables like Utilities and Street were excluded for low variance (>95% single-category dominance).
  • All retained variables have η² > 0.1 and GVIF^(1/(2*Df)) < 2, ensuring both predictive power and model stability.

3 Variables We choose

Rank Variable Type η² Adjusted GVIF^(1/(2*Df)) Correlation with SalePrice Practical Relevance
1 OverallQual Categorical 0.75 1.65 High (Strong Positive) Overall material/finish quality is the strongest single predictor of home value.
2 GrLivArea Numerical - - High (Strong Positive) Above ground living area (sq ft) is a major determinant of price.
3 TotalBsmtSF Numerical - - High (Strong Positive) Total basement area adds significant usable space, impacting price.
4 Neighborhood Categorical 0.62 1.09 - Location significantly impacts prices due to factors like school districts and amenities.
5 GarageArea Numerical - - High (Strong Positive) Larger garage size contributes to convenience and value.
6 GarageCars Numerical - - High (Strong Positive) Number of garage spaces affects usability and desirability.
7 1stFlrSF Numerical - - Moderate to High First-floor size is linked to living comfort and value.
8 GarageFinish Categorical 0.19 1.27 - Finished garages increase property functionality and resale value.
9 Foundation Categorical 0.18 1.20 - Foundation type (e.g., poured concrete) impacts longevity and maintenance costs.
10 FullBath Numerical - - Moderate Number of full bathrooms influences home value but is secondary to space.
11 TotRmsAbvGrd Numerical - - Moderate Total rooms above ground can add value but depends on layout and design.
12 Exterior1st Categorical 0.15 1.08 - Exterior material (e.g., brick, vinyl) affects curb appeal and durability.
13 GarageType Categorical 0.13 1.10 - Garage configuration (attached vs. detached) affects usability and appeal.
14 MasVnrType Categorical 0.12 1.15 - Masonry veneer type (e.g., stone, brick) contributes to structural aesthetics.
15 BsmtFinType1 Categorical 0.11 1.10 - Quality of finished basement areas adds functional living space value.
16 YearBuilt Numerical - - Low to Moderate Newer homes typically have higher prices but with variability.
17 MoSold Numerical - - - Month sold (MM) helps analyze seasonality and sales trends.
18 YrSold Numerical - - - Year sold (YYYY) is useful for observing long-term market trends.

To analyze seasonality and time-series trends, we include:

  • MoSold: Month Sold (MM) to observe seasonal patterns.
  • YrSold: Year Sold (YYYY) to track long-term housing market trends.

4 Visualization

4.1 Key Variables and Their Relationship to Sale Price

In the figures below, we focus on the selected variables that demonstrate a strong relationship with SalePrice. Notably, we exclude both YrSold and MoSold to emphasize the more impactful features in our dataset.

This composite figure shows 16 subplots illustrating both scatter plots (top row) and box plots (middle and bottom rows) for the selected variables. The scatter plots reveal how various numeric predictors (e.g., living area, basement area, garage size) trend with SalePrice (fitted by the red line), while the box plots capture how different categorical or discrete features (e.g., quality ratings, neighborhood, foundation type) influence sale prices. By examining these subplots together, we can identify which factors most strongly affect house prices and use that insight to guide further analysis.

4.2 Figure: Average SalePrice Over Time (SoldYM)

This line chart depicts the monthly average of house sale prices, providing a temporal perspective on how property values fluctuate across different months. The red points mark the mean price in each period, while the blue line highlights overall trends over time.

4.3 LOCATION

4.3.1 Map: Neighborhood Median Sale Prices

Figure: Neighborhood Median Sale Prices
This interactive map displays each neighborhood’s median sale price using color-coded markers for different price brackets. Hover over a marker to see additional property details (e.g., living area, basement size, overall quality), offering a localized perspective on how housing values vary across the city.

4.3.2 Figure: Interactive Radar Charts

These radar plots allow for side-by-side comparisons of selected attributes—such as living area, basement size, overall quality, and more—across one or multiple neighborhoods. By toggling different neighborhoods on or off, you can visually contrast their strengths and weaknesses in each category, providing deeper insight into how these factors influence housing values.

4.4 TIME

4.4.1 Figure: Monthly Median Housing Prices

This line chart tracks how the median sale price fluctuates across different months, offering insights into potential seasonal or cyclical patterns. Red points mark each month’s median price, while the connecting line highlights the overall trend throughout the year.

4.4.2 Figure: Quarterly Price Distribution

This box plot groups sale prices by quarter (Q1 through Q4), illustrating how values fluctuate throughout the year. Each box captures the median and interquartile range, while outliers reflect unusually high or low transactions during that period.

4.4.5 Figure: Economic Events and Their Impact on Housing Prices

This line chart highlights the broader timeline of median sale prices from 2006 to 2010, with key economic events labeled along the top. Spikes or dips in the trend line may correlate with these milestones, suggesting possible cause-and-effect relationships in the housing market.

4.4.5.1 Lag Effect of Lehman Brothers Bankruptcy

Marked by the red dashed line, this plot shows median sale prices in the months before and after the Lehman Brothers bankruptcy. The aim is to detect any delayed impact on housing values following this significant financial collapse.

4.4.5.2 Figure 3: Lag Effect of Homebuyer Tax Credit

Centered on the tax credit event (red dashed line), this chart tracks median sale prices to see whether the introduction of the incentive had an immediate or gradual influence on home values.

4.4.5.3 Figure 4: Lag Effect of Subprime Crisis

By aligning monthly median prices around the onset of the subprime crisis, this plot illustrates how housing values behaved just prior to—and in the aftermath of—this pivotal economic turning point.

4.5 Distributions of Key Variables

The top row shows histograms of various numeric features—such as GrLivArea, TotalBsmtSF, GarageArea, and X1stFlrSF—highlighting their frequency distributions. The middle row covers discrete or ordinal attributes like YearBuilt, OverallQual, and TotRmsAbvGrd, providing insights into how homes are spread across different quality ratings and room counts.

In the bottom row, pie charts break down the proportions of categorical features (e.g., FullBath, GarageFinish, MasVnrType), revealing which categories dominate each variable. These visualizations help us understand the overall composition of the dataset and guide decisions about how best to handle each variable in further analysis.

4.6 Base R Time-Series STL Decomposition

This four-panel plot displays: 1. The raw time series data (top panel). 2. The seasonal component, highlighting recurring monthly fluctuations. 3. The trend component, illustrating the long-term direction in house prices. 4. The remainder (bottom panel), capturing short-term irregularities not explained by seasonality or trend.

4.6.1 ggplot STL Decomposition

Each component—Remainder, Seasonal, and Trend—is shown in a separate facet, with the x-axis representing time and the y-axis showing each component’s magnitude. The seasonal curve exhibits regular peaks and troughs over the year, the trend line reveals how median sale prices evolve over time, and the remainder indicates any short-term fluctuations once the seasonal and trend effects have been removed.

4.7 Scatterplot Matrix of Selected Numeric Variables (could change)

Each off-diagonal panel shows a scatterplot for a pair of numeric features (e.g., GrLivArea vs. GarageArea), while the diagonal panels display the univariate distribution of each variable (here, density plots). Correlation coefficients in the upper panels summarize the strength of each pairwise relationship, helping to identify potential outliers, clusters, and overall patterns in the data.

4.8 SalePrice vs. OverallQual by Neighborhood

This faceted box plot displays how the distribution of sale prices changes with different overall quality ratings (OverallQual) across various neighborhoods. Each facet corresponds to a specific neighborhood, helping you quickly see whether high-quality homes fetch significantly higher prices in certain areas compared to others. Red points mark potential outliers, indicating unusually high or low values within each category.

4.9 2D Density Plot: GrLivArea vs. TotalBsmtSF (could change)

4.10 Hex Bin Plot: GarageArea vs. YearBuilt (could change)

4.11 K-Means Clusters (k = 3) on GrLivArea vs. TotalBsmtSF (could change)

## 'data.frame':    1460 obs. of  4 variables:
##  $ GrLivArea  : num  1710 1262 1786 1717 2198 ...
##  $ TotalBsmtSF: num  856 1262 920 756 1145 ...
##  $ GarageArea : num  548 460 608 642 836 480 636 484 468 205 ...
##  $ OverallQual: num  7 6 7 7 8 5 8 7 7 5 ...

4.12 Parallel Coordinates Plot of K-Means Clusters (could change)

4.13 Clustered Neighborhoods Based on Housing Characteristics

5 References

  1. De Cock, D. (2011).
    Ames, Iowa: Alternative to the Boston Housing Data as an End of Semester Regression Project.
    Journal of Statistics Education, 19(3).
    https://jse.amstat.org/v19n3/decock.pdf

  2. Sirmans, G., & Macpherson, D. A. (2003).
    The Value of Housing Characteristics: A Meta Analysis.
    https://www.researchgate.net/publication/5151851_The_Value_of_Housing_Characteristics_A_Meta_Analysis

  3. NYC Data Science Academy. (2019).
    Analyzing Data to Predict Housing Prices in Ames, Iowa.
    https://nycdatascience.com/blog/student-works/analyzing-data-to-predict-housing-prices-in-ames-iowa-6/

  4. El Mouna, L., Silkan, H., Haynf, Y., Nann, M. F., & Tekouabou, S. C. K. (2023).
    A Comparative Study of Urban House Price Prediction Using Machine Learning Algorithms.
    E3S Web of Conferences, 418, 03001.
    https://doi.org/10.1051/e3sconf/202341803001
    Retrieved from ResearchGate

  5. Guo, J. (2023).
    Feature Selection in House Price Prediction.
    Highlights in Business, Economics and Management, 21, 14755.
    https://doi.org/10.54097/hbem.v21i.14755
    Retrieved from ResearchGate

  6. Manasa, J., Gupta, R., & Nuggenahalli, N. S. (2020).
    Machine Learning Based Predicting House Prices Using Regression Techniques.
    In 2020 2nd International Conference on Innovative Mechanisms for Industry Applications (ICIMIA) (pp. 9074952).
    https://doi.org/10.1109/ICIMIA48430.2020.9074952
    Retrieved from ResearchGate

  7. Kuhn, M., & Silge, J. (2021).
    Tidy Modeling with R: The Ames Housing Data.
    https://www.tmwr.org/ames